Skip to content

feat(duplicates): body_hash structural duplication detection#178

Merged
SutuSebastian merged 4 commits into
mainfrom
feat/body-hash-duplication
Jun 10, 2026
Merged

feat(duplicates): body_hash structural duplication detection#178
SutuSebastian merged 4 commits into
mainfrom
feat/body-hash-duplication

Conversation

@SutuSebastian

@SutuSebastian SutuSebastian commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add symbols.body_hash at index time (canonical function-body AST: identifiers → $id, literals → kind) for function-shaped symbols, with SCHEMA_VERSION 39 and partial index idx_symbols_body_hash.
  • Ship bundled duplicates recipe (per-symbol rows + duplicate_count via CTE) and agent rule trigger for structural duplicate discovery.
  • Retire docs/plans/ast-hash-duplication.md into architecture, glossary, golden-queries, and roadmap; includes changeset for minor release.

Test plan

  • bun test src/extractors/body-hash.test.ts (11 tests: FD, arrows, methods, getters, templates)
  • bun run typecheck
  • CODEMAP_ROOT=fixtures/minimal bun scripts/query-golden.ts (includes new duplicates scenario)
  • bun test scripts/query-golden-coverage-matrix.test.mjs
  • Pre-commit hook on commit

Summary by CodeRabbit

  • New Features

    • Structural duplicate function detection using canonical AST analysis
    • New duplicates recipe for identifying identical code bodies across files
  • Documentation

    • Updated architecture and glossary with duplicate detection details
    • Added recipe documentation with query parameters and triaging guidance
  • Fixtures

    • Added benchmark files and golden fixtures for duplicate detection testing

Add symbols.body_hash (canonical body AST for function-shaped symbols),
SCHEMA_VERSION 39, duplicates recipe, golden scenario, and agent rule row.
Retire ast-hash-duplication plan into architecture/glossary/golden-queries.
@changeset-bot

changeset-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: e5e289b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@stainless-code/codemap Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@SutuSebastian, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes and 25 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a06ae370-4bf4-4324-929e-9718350473c7

📥 Commits

Reviewing files that changed from the base of the PR and between fa1cd93 and e5e289b.

📒 Files selected for processing (14)
  • .changeset/ast-hash-duplication.md
  • docs/architecture.md
  • docs/glossary.md
  • docs/golden-queries.md
  • docs/roadmap.md
  • scripts/agent-eval/scenarios.json
  • scripts/duplicates-recipe-scope.test.mjs
  • scripts/spike-crap-reachability.test.mjs
  • src/extractors/body-hash.test.ts
  • src/extractors/body-hash.ts
  • src/extractors/symbols.ts
  • src/extractors/types.ts
  • templates/recipes/duplicates.md
  • templates/recipes/duplicates.sql
📝 Walkthrough

Walkthrough

This PR adds structural duplicate detection to Codemap by computing a canonical SHA-256 hash of function bodies at index time, storing it in a new symbols.body_hash column, and providing a duplicates recipe to find and group identical bodies across the codebase. The implementation includes schema updates, a deterministic AST canonicalization algorithm, visitor integration, SQL template, and comprehensive documentation and test fixtures.

Changes

Structural Duplicate Detection

Layer / File(s) Summary
Database schema and storage
src/db.ts, .changeset/ast-hash-duplication.md
Schema version bumped to 39; symbols table adds body_hash nullable column and partial index idx_symbols_body_hash; SymbolRow interface extended with optional `body_hash?: string
Body hash canonicalization and computation
src/extractors/body-hash.ts, src/extractors/body-hash.test.ts
canonicalizeBody performs deterministic depth-first serialization normalizing identifiers to $id, literals to kind-only, and sorting object keys; hashFunctionBody wraps it to return SHA-256 hex for non-trivial bodies (skipping body_line_count < 2); comprehensive tests verify isomorphism across functions, arrow functions, class members, and template literals.
Body hash extractor and pipeline integration
src/extractors/body-hash.ts, src/parser.ts
bodyHashExtractor registers visitor handlers for function AST nodes (FunctionDeclaration:exit, ArrowFunctionExpression:exit, FunctionExpression:exit) to compute and assign body_hash to matching symbols; integrated into parser extractor chain after complexityExtractor.
Duplicates recipe and usage
templates/recipes/duplicates.sql, templates/recipes/duplicates.md, templates/agent-content/rule/00-full.md
SQL query groups symbols by non-null body_hash, filters by minimum duplicate count and optional path/body-length constraints, returns up to 50 results ordered by frequency; recipe markdown documents configurable parameters, usage examples, and false-positive caveats; trigger pattern added to agent rule templates.
Documentation
docs/architecture.md, docs/glossary.md, docs/golden-queries.md, docs/roadmap.md
Schema documentation describes body_hash purpose and population rules; glossary defines SHA-256 of canonicalized AST with normalization details; golden-queries section specifies output columns and fixture sources; roadmap updated with specific normalization and indexing approach.
Test fixtures and benchmarks
fixtures/minimal/src/bench/duplicate-body-*.ts, fixtures/golden/minimal/*, fixtures/CAPABILITIES.json, fixtures/golden/scenarios.json
New benchmark functions duplicateAlpha and duplicateBeta with identical conditional bodies; 14 golden JSON fixtures updated with new benchmark entries and counts; new duplicates.json golden fixture listing detected duplicates with metadata; scenario and capability entries registered.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Poem

A rabbit hops through AST trees so tall,
Counting bodies that look the same to all,
SHA-256 marks the duplicates found,
While sorted keys make their hash sound,
Thump thump—the recipe grounds them down! 🐰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(duplicates): body_hash structural duplication detection' clearly describes the main feature added: body_hash-based structural duplication detection. It accurately summarizes the core change across all modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/body-hash-duplication

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Fix deferred correctness/perf items: return-position Literal:nullish for
null/undefined/void 0/bare return; void 0 only (not all void); FD symbol
index via markArrowSymbol at push time; docs parity and regression tests.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/golden-queries.md`:
- Line 83: Update the documentation for the `duplicates` query and
`symbols.body_hash` population: replace the narrow phrase "named functions,
arrows, and class methods" with the precise contract "function-shaped symbols
(function, method, getter, setter)" and note the `body_hash` is set at index for
those symbol kinds when `body_line_count >= 2`; ensure the term "function-shaped
symbols" is used consistently in the `duplicates` description and any other
mentions of `symbols.body_hash` so readers understand getter/setter coverage.

In `@templates/recipes/duplicates.sql`:
- Around line 5-10: Duplicate grouping is done before applying scope filters;
move scope filters into the input set to the grouping (e.g., build a
filtered_symbols CTE or subquery selecting from symbols with the params filters
such as path_prefix and min_body_lines) so that the GROUP BY on body_hash and
the HAVING (min_count from params) operate only on the scoped rows. Locate the
symbols table usage and replace the direct GROUP BY on symbols.body_hash with
grouping over the filtered_symbols result (preserving references to body_hash
and params), and ensure any later joins or WHEREs that reference
path_prefix/min_body_lines are applied inside that filtered source rather than
after aggregation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0a85f5a-1c09-4fca-8725-6a79c6791a24

📥 Commits

Reviewing files that changed from the base of the PR and between bb8bae9 and fa1cd93.

📒 Files selected for processing (30)
  • .changeset/ast-hash-duplication.md
  • docs/architecture.md
  • docs/glossary.md
  • docs/golden-queries.md
  • docs/plans/ast-hash-duplication.md
  • docs/roadmap.md
  • fixtures/CAPABILITIES.json
  • fixtures/golden/minimal/barrel-files.json
  • fixtures/golden/minimal/coverage-confirmed-dead-no-ingest.json
  • fixtures/golden/minimal/coverage-confirmed-dead.json
  • fixtures/golden/minimal/duplicates.json
  • fixtures/golden/minimal/files-count.json
  • fixtures/golden/minimal/files-hashes.json
  • fixtures/golden/minimal/index-summary.json
  • fixtures/golden/minimal/index-table-stats.json
  • fixtures/golden/minimal/refactor-risk-ranking.json
  • fixtures/golden/minimal/source-fts-row-count.json
  • fixtures/golden/minimal/unimported-exports.json
  • fixtures/golden/minimal/untested-and-dead.json
  • fixtures/golden/minimal/worst-covered-exports.json
  • fixtures/golden/scenarios.json
  • fixtures/minimal/src/bench/duplicate-body-a.ts
  • fixtures/minimal/src/bench/duplicate-body-b.ts
  • src/db.ts
  • src/extractors/body-hash.test.ts
  • src/extractors/body-hash.ts
  • src/parser.ts
  • templates/agent-content/rule/00-full.md
  • templates/recipes/duplicates.md
  • templates/recipes/duplicates.sql
💤 Files with no reviewable changes (1)
  • docs/plans/ast-hash-duplication.md

Comment thread docs/golden-queries.md Outdated
Comment thread templates/recipes/duplicates.sql
Apply path_prefix and min_body_lines before duplicate_count aggregation
so scoped queries report accurate group sizes. Align golden-queries
wording with function-shaped symbol contract (getter/setter included).
Wire duplication.body-hash agent-eval probe; fix spike-crap tier count
after fixture symbols; add duplicates-recipe-scope regression tests;
setter/nullish-scope unit tests; consumer doc caveats (LIMIT, async/gen).
@SutuSebastian SutuSebastian merged commit 36106ff into main Jun 10, 2026
11 checks passed
@SutuSebastian SutuSebastian deleted the feat/body-hash-duplication branch June 10, 2026 13:12
@github-actions github-actions Bot mentioned this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant